pem: direct-query gRPC endpoint — stub + TDD contract by entlein · Pull Request #49 · k8sstormcenter/pixie

entlein · 2026-06-04T13:19:32Z

Summary: PEM direct-query gRPC endpoint (entlein/dx#29). Make the metadata-connected vizier-pem serve api.vizierpb.VizierService.ExecuteScript directly over gRPC, JWT-authenticated, so dx queries the node-local PEM with no broker hop. Ports the standalone_pem capability with two upgrades: metadata-connected reuse of the live PEM Carnot + agent metadata (closes #15), and HS256 service-token auth via the cluster jwt-signing-key (no longer insecure). Drain fix (execution_and_timing_info -> QueryData.execution_stats wire-roundtrip + skip payload-less responses) closes the unimplemented type stream error caught on live PEM soak. Fail-soft direct-query startup (try/catch + step 1/6 to 6/6 breadcrumbs) so init failure can never crashloop the data plane. Manager::Init refuses to start with empty PL_JWT_SIGNING_KEY (catches the jwt::SigningError that crashloop'd the stock 0.14.17 PEM).

Relevant Issues: entlein/dx#29 (PEM direct-query)

Type of change: /kind feature

Test Plan: bazel test //src/vizier/services/agent/pem:direct_query_server_test (auth-negative + ValidToken_TrivialQuery_StreamsRows green); bazel test //src/vizier/services/agent/shared/manager/... (5/5 pass, JWT guard wired through Manager::Init); vizier-release CI builds pemdq6 image (commit 50dffb0 / tag release/vizier/v0.14.19-pemdq6); live PG soak

STUB PR. Makes the normal (metadata-connected) vizier-pem serve api.vizierpb.VizierService.ExecuteScript directly, authenticated by the cluster JWT, so dx can query its node-local PEM with no broker hop — the durable per-node evidence path. Ports the capability proven by src/experimental/standalone_pem (VizierServer), but metadata-connected (per-pod PxL filters resolve — closes the gap that sidelined standalone_pem) and authenticated. This commit is the contract + red TDD only (no execution logic): - DIRECT_QUERY_CONTRACT.md — authoritative spec: endpoint, flags (default-off), auth, and the behavioral acceptance criteria. - direct_query_server.{h,cc} — DirectQueryServer (VizierService::Service) + the AuthenticateRequest seam; both fail closed (UNAUTHENTICATED / UNIMPLEMENTED). - direct_query_server_test.cc — in-process gRPC contract test. Auth-negative cases pass against the fail-closed stub; ValidToken_* + per-pod-filter are the red work. - BUILD.bazel — direct-query deps on cc_library + the pl_cc_test target. dx-agent authored the contract + owns the dx-side switch (DX_BENCH=pemdirect, trivial reuse of cmd/dx-daemon/pxbroker.go). pem-agent (build VM) implements the C++ to green: port the standalone execution path against the live Carnot, implement JWT verify + the matching test token-maker, and add a Carnot fixture for the streams-rows / per-pod-filter cases. NOT compiled here (this VM has no bazel by design); the pem-agent builds + iterates on the oracle runner. Refs #29.

coderabbitai · 2026-06-04T13:19:51Z

Need the big picture first? Review this PR in Change Stack to see what changed before going file by file.

📝 Walkthrough

Walkthrough

Adds a PEM direct-query ExecuteScript gRPC endpoint: contract, header, server with HS256 bearer-token verification, streaming non-mutation execution, in-process gtests, BUILD/visibility wiring, k8s util tweaks, and CI runner updates.

Changes

PEM Direct-Query Feature

Layer / File(s)	Summary
Direct-query contract and interface `src/vizier/services/agent/pem/DIRECT_QUERY_CONTRACT.md`, `src/vizier/services/agent/pem/direct_query_server.h`	Contract doc added; header declares `AuthenticateRequest` and `DirectQueryServer` with constructor taking `carnot::Carnot`, `carnot::EngineState`, `LocalGRPCResultSinkServer*`, and a JWT signing key; `ExecuteScript` RPC declared to authenticate then stream responses.
Server implementation, auth, and streaming `src/vizier/services/agent/pem/direct_query_server.cc`	Implements case-insensitive `Bearer` parsing, RFC7515 base64url decoding, HS256 HMAC-SHA256 signature verification (constant time), `aud` and numeric `exp` checks; `AuthenticateRequest` returns `UNAUTHENTICATED` on failure (VLOGs detailed reasons); `ExecuteScript` invokes auth, rejects mutations with `UNIMPLEMENTED`, compiles plan, emits schema responses, executes on Carnot, and streams drained results.
In-process tests and contract TDD `src/vizier/services/agent/pem/direct_query_server_test.cc`	Adds GTest fixtures that run an in-process DirectQueryServer and VizierService stub; `MakeBearerToken` creates HS256 tokens (valid/wrong-key/expired); tests assert `UNAUTHENTICATED` for missing/wrong/expired tokens and `UNIMPLEMENTED` when mutation flag is set; includes an exec fixture that runs a trivial query and streams rows.
PEM runtime wiring and flags `src/vizier/services/agent/pem/pem_main.cc`, `src/vizier/services/agent/pem/pem_manager.cc`, `src/vizier/services/agent/pem/pem_manager.h`, `src/vizier/services/agent/shared/manager/manager.cc`	Adds gflags (`direct_query_enabled`, `direct_query_port`, `direct_query_jwt_signing_key`) and PEMManager methods to conditionally start/stop a dedicated local Carnot, register a DirectQueryServer, and host a gRPC server on the configured port; startup now validates JWT signing key and shutdown stops the direct-query server.
Bazel build/test and visibility updates `src/vizier/services/agent/pem/BUILD.bazel`, `src/carnot/BUILD.bazel`, `src/carnot/exec/BUILD.bazel`, `src/carnot/udf/BUILD.bazel`	Updates `pl_cc_library` deps to include `//src/api/proto/vizierpb:vizier_pl_cc_proto`, `//src/carnot`, `@boringssl//:crypto`, gRPC++, rapidjson, and `sole`; adds `pl_cc_test(name = "direct_query_server_test")`; exposes Carnot headers/impl/test fixtures to PEM via visibility labels; exports `local_grpc_result_server.h`.
K8s utilities and CI runner update `src/utils/shared/k8s/apply.go`, `src/utils/shared/k8s/delete.go`, `src/stirling/source_connectors/socket_tracer/testing/container_images/BUILD.bazel`, `.github/workflows/vizier_release.yaml`	Reorders k8s imports; migrates `sets.String` usages to generics `sets.Set[string]` and updates constructors; reorders go container BUILD args; updates two vizier_release `runs-on` runners to `oracle-vm-16cpu-64gb-x86-64`.

🎯 3 (Moderate) | ⏱️ ~20 minutes

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 69.33% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (4 passed)

Check name	Status	Explanation
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Title check	✅ Passed	The PR title accurately summarizes the main change: adding a JWT-authenticated gRPC direct-query endpoint to PEM with test-driven development and contract specification.
Description check	✅ Passed	The pull request description accurately and comprehensively describes the changeset: a PEM direct-query gRPC endpoint with JWT authentication, Carnot reuse, drain fixes, fail-soft startup, and security hardening.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

📝 Generate docstrings

Create stacked PR
Commit on current branch

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch entlein/pem-direct-query

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@src/vizier/services/agent/pem/direct_query_server_test.cc`:
- Around line 58-67: The test ValidToken_Mutation_Unimplemented is unreachable
because MakeBearerToken returns placeholder non-JWT strings and
AuthenticateRequest always returns UNAUTHENTICATED; change the test surface so a
"valid" token is actually recognized: either make MakeBearerToken produce a
token format that AuthenticateRequest will accept as valid (e.g., generate a
simple signed JWT or a recognized test token when TokenKind::kValid) or adjust
AuthenticateRequest (in direct_query_server.cc's AuthenticateRequest) to accept
the test placeholder for TokenKind::kValid; update MakeBearerToken's TokenKind
cases (kValid, kWrongKey, kExpired) to return distinct tokens that map to
AuthenticateRequest's logic so the ValidToken_Mutation_Unimplemented path can
observe UNIMPLEMENTED as intended.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: ASSERTIVE

Plan: Pro Plus

Run ID: 649f9e15-2555-4317-9973-b200df724e3d

📥 Commits

Reviewing files that changed from the base of the PR and between 9b2721f and 7c717c5.

📒 Files selected for processing (5)

src/vizier/services/agent/pem/BUILD.bazel
src/vizier/services/agent/pem/DIRECT_QUERY_CONTRACT.md
src/vizier/services/agent/pem/direct_query_server.cc
src/vizier/services/agent/pem/direct_query_server.h
src/vizier/services/agent/pem/direct_query_server_test.cc

These two findings pre-exist on origin/main HEAD and would block CI run-container-lint on every PR until cleared: src/utils/shared/k8s/apply.go:33 gci File is not properly formatted src/utils/shared/k8s/delete.go:126 SA1019 sets.String is deprecated Fixes are mechanical: - apply.go: golangci-lint --fix reordered the k8s.io/apimachinery imports so the aliased k8serrors line sorts alphabetically by its path, not its alias. - delete.go: sets.String → sets.Set[string], sets.NewString → sets.New[string] (the generic replacement k8s.io flagged in client-go). Touched here as part of the #29 stub-cleanup pass so the pre-commit hook + CI run-container-lint pass on the PEM direct-query work.

dx-agent's kickoff flagged likely include/dep nits — confirmed two plus a -Wunused-private-field nit that surfaced from -Werror, plus a clang-format / IWYU sweep on the three stub files. BUILD.bazel 1. //src/common/testing:cc_library — duplicate label (pl_cc_test auto-injects gtest/gmock). Removed; mirrors tracepoint_manager_test. 2. //src/carnot:cc_library — not visible to PEM (default_visibility is //src/carnot:__subpackages__ + //src/experimental:__subpackages__, which is how standalone_pem reaches it but not us). Switched to //src/carnot:carnot — the public header target, explicitly opened to //src/vizier/services/agent:__subpackages__. Sub-deps for the real exec path (engine_state, planner/compiler) land in Step 2. direct_query_server.cc 3. -Wunused-private-field on carnot_ + engine_state_ — stub holds the pointers for the Step 2 wiring but doesn't touch them yet, and clang-15 + -Werror rejects. Added (void) casts inside the UNIMPLEMENTED body; same pattern as the existing (void)writer. direct_query_server.h 4. <utility> added for std::move (build/include_what_you_use warning). Plus auto-applied: - clang-format on .cc/.h/_test.cc (broke long status strings). - Trailing-whitespace strip on DIRECT_QUERY_CONTRACT.md L84. RED state captured: bazel test //src/vizier/services/agent/pem:direct_query_server_test 3 PASS — NoToken / WrongKey / Expired → UNAUTHENTICATED (fail-closed stub already gets these for the right reason). 1 FAIL — ValidToken_Mutation_Unimplemented: placeholder MakeBearerToken fails auth before the mutation branch fires. Step 1's real JWT mint + verify unlocks this. 2 SKIP — ValidToken_TrivialQuery_StreamsRows (Step 2) and PerPodFilter_MetadataConnected (Step 3). Next: Step 1 — port manager.cc:423 jwt::jwt_object HS256 mint pattern into both MakeBearerToken (test) and AuthenticateRequest (server), using jwt::decode against jwt_signing_key.

Server-side (AuthenticateRequest) - Extracts the "authorization" header from ServerContext metadata; gRPC lowercases keys but not values, and manager.cc:440 mints with a lowercase "bearer " prefix while RFC 6750 calls for "Bearer " — we accept both. - Manually parses <header>.<payload>.<signature>: * verifies "alg":"HS256" in the decoded header (refuses an "alg":"none" forgery at the door), * recomputes HMAC-SHA256 over <header>.<payload> with the signing key using BoringSSL's HMAC(EVP_sha256(), …) and constant-time-compares against the base64url-decoded signature, * validates aud == "vizier" and exp > now. - All failure paths collapse to UNAUTHENTICATED on the wire (no claim- level detail leaked to peers); VLOG(1) keeps the diagnostic. Why not jwt::decode for verify Cpp_jwt's HMACSign<>::verify calls BIO_f_base64() out of BoringSSL's src/decrepit/bio/base64_bio.c — that file isn't in @boringssl//:crypto on this fork, and decrepit/ isn't exposed as its own bazel package. Two unblock options: (a) patch boringssl.patch to add a :decrepit target — fork-level + invasive, or (b) inline the verify ourselves with native BoringSSL HMAC — ~150 LoC, no patch, what's done here. Mint side still uses cpp_jwt (one-line jwt::jwt_object); the mint path never touches BIO_f_base64. Test-side mint (MakeBearerToken) Mirrors GenerateServiceToken in src/vizier/services/agent/shared/manager /manager.cc:423-440 — HS256, iss=PL, aud=vizier, iat/nbf/exp, sub=service. kValid: signed with `signing_key`, exp +60s; kWrongKey: caller passes the wrong key so the HMAC's against the wrong secret; kExpired: signed with `signing_key`, exp -60s. BUILD.bazel - + @boringssl//:crypto (BoringSSL HMAC + EVP_sha256) - + @com_github_tencent_rapidjson//:rapidjson (claim parsing) - cpp_jwt now only on the test target (for MakeBearerToken). Result bazel test //src/vizier/services/agent/pem:direct_query_server_test → 4 PASS for the right reason: NoToken / WrongKey / Expired → UNAUTHENTICATED (verifier really rejects rather than the stub failing-closed), ValidToken_Mutation_Unimplemented → auth passes, mutation guard fires. → 2 SKIP: ValidToken_TrivialQuery_StreamsRows (Step 2), PerPodFilter_MetadataConnected (Step 3).

…#29) Structural scaffolding for the ExecuteScript port from standalone_pem/vizier_server.h. The dx-agent's contract says reuse the PEM's already-running Carnot + EngineState — that's the production wiring landing in Step 4. For the unit test, we'll build a CarnotTest-style fixture (table_store + http_events seed + Carnot configured with a LocalGRPCResultSinkServer) in Step 2b. This commit just adds the missing 4th ctor parameter — the LocalGRPCResultSinkServer the server reads results from after Carnot::ExecuteQuery returns. Forward-declared in the header (test target doesn't need to pull the impl include yet); the auth-only tests pass nullptr. Mutation/exec paths still UNIMPLEMENTED — Step 2b ports the real compile + execute + drain + stream sequence. Test stays at 4 PASS + 2 SKIP (no behavior change).

coderabbitai

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@src/vizier/services/agent/pem/direct_query_server.cc`:
- Around line 166-190: The JWT validation in direct_query_server.cc currently
only checks aud and exp; update the validation (the block that parses payload
and checks claims) to also enforce the service-token shape emitted by
GenerateServiceToken/MakeBearerToken: verify payload["iss"] exists and equals
"PL", payload["sub"] exists and equals "service", payload["Scopes"] exists and
equals "service" (or the exact scope string used by GenerateServiceToken), and
payload["ServiceID"] exists and is a non-empty string; if any of these checks
fail return ::grpc::Status(::grpc::StatusCode::UNAUTHENTICATED) with a clear
message (e.g., "direct-query: missing/invalid iss/sub/Scopes/ServiceID") so only
tokens matching the cluster service JWT contract are accepted.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: ASSERTIVE

Plan: Pro Plus

Run ID: 58728583-2499-4a05-9205-67c02231ee8f

📥 Commits

Reviewing files that changed from the base of the PR and between b89f885 and 00e198c.

📒 Files selected for processing (4)

src/vizier/services/agent/pem/BUILD.bazel
src/vizier/services/agent/pem/direct_query_server.cc
src/vizier/services/agent/pem/direct_query_server.h
src/vizier/services/agent/pem/direct_query_server_test.cc

Real port of standalone_pem/vizier_server.h:60-181 against the DirectQueryServer ctor's live Carnot + EngineState + LocalGRPCResultSinkServer. ExecuteScript impl (direct_query_server.cc) - After auth + mutation guard: compile via engine_state_->CreateLocalExecutionCompilerState(0) → Compiler().Compile. - Walk the plan once and write one meta_data-only ExecuteScriptResponse per GRPC_SINK_OPERATOR sink so the client sees column types up front (same shape standalone_pem produces, so dx's pxapi consumer reads it). - Reset the sink → carnot_->ExecuteQuery(query, query_id, CurrentTimeNS) (synchronous; matches carnot_test.cc:110 and standalone_pem:176) → drain result_server_->raw_query_results() into ExecuteScriptResponse. - Per-chunk: copy table_id/num_rows/eow/eos; column data marshal is a TODO documented for Step 4's live e2e (carnotpb RowBatchData ↔ vizierpb RowBatchData column variants is per-type translation that the schema responses above already cover for client consumers that only read meta). - carnot/engine/sink null at ExecuteScript time → FAILED_PRECONDITION rather than crash. Auth tests still pass nullptr; the exec tests build the real fixture. Test (direct_query_server_test.cc) - DirectQueryServerExecTest fixture builds a CarnotTest-style stack: TableStore + LocalGRPCResultSinkServer + udf::Registry + funcs::RegisterFuncsOrDie + Carnot::Create with the sink stub generator wired through ClientsConfig. http_events table seeded inline (same 5-column subset as CarnotTestUtils::HTTPEventsTable — empty rows are fine; the trivial query just enumerates the schema). - ValidToken_TrivialQuery_StreamsRows flipped from GTEST_SKIP to a real assertion: ExecuteScript returns OK and streams ≥1 response. Visibility opened on three carnot subtargets for the PEM test fixture (same pattern //src/experimental/standalone_pem already uses for the broader set): - //src/carnot:cc_library - //src/carnot/exec:cc_library (LocalGRPCResultSinkServer header promoted from globbed-impl-only to hdrs) - //src/carnot/exec:test_utils - //src/carnot/udf default_visibility all add //src/vizier/services/agent/pem:__pkg__. Result bazel test //src/vizier/services/agent/pem:direct_query_server_test → 5 PASS: NoToken / WrongKey / Expired → UNAUTHENTICATED ValidToken_Mutation → UNIMPLEMENTED ValidToken_TrivialQuery_StreamsRows → OK + ≥1 streamed response (new) → 1 SKIP: PerPodFilter_MetadataConnected (Step 3)

Three gflags for the direct-query endpoint, each environ-fallback so operators can opt in via either flag or env var (matching the rest of the PEM's flag style): --direct_query_enabled / PL_PEM_DIRECT_QUERY_ENABLED (default: false) --direct_query_port / PL_PEM_DIRECT_QUERY_PORT (default: 50305) --direct_query_jwt_signing_key / PL_JWT_SIGNING_KEY (default: "") PL_JWT_SIGNING_KEY intentionally shares the existing env name with manager.cc's outgoing mint path (DEFINE_string(jwt_signing_key)) — one secret covers both directions, no new ConfigMap/Secret bind required. Default false → flag off → existing PEM deployments byte-identical. The pem_manager-side construction (which has access to the live Carnot + EngineState) lands in the next commit; this commit is the flag surface + DIRECT_QUERY_CONTRACT.md's documented env names landing in the binary.

Upstream's vizier_release.yaml uses oracle-16cpu-64gb-x86-64 and oracle-8cpu-32gb-x86-64 runs-on labels — neither exists on this k8sstormcenter/pixie fork's self-hosted pool, so tag-triggered release builds would queue forever (which is exactly what the closed PR #48 flagged + the user explicitly approved fixing in its closing comment: "Nice catch on the runner label, though!"). Same single substitution PR #48 used: both labels → oracle-vm-16cpu-64gb-x86-64 (the fork's actual VM label, already used by perf_clickhouse.yaml and perf_soc_attack.yaml). Lands on this branch as Step 6 prep — without it, the release/vizier/v... tag that builds + pushes vizier-pem_image (including the direct-query endpoint) never gets a runner.

…hal (#29) Two live-e2e-blockers dx-agent caught reviewing my Step 1+2b post-mortem: 1. aud is a JSON ARRAY, not a string. Pixie's go mint (src/shared/services/utils/jwt.go:46) builds Audience([]string{...}) → lestrrat-go/jwx serializes as "aud":["vizier"]. My verifier's literal string compare would have UNAUTHENTICATED every live call while the unit tests stayed green (they minted a string-form aud). Verifier now accepts both forms per RFC 7519 §4.1.3; the test mint is switched to the array form so the unit guards the regression. 2. Per-row column data is required, not a TODO. dx's HandleRecord reads r.Data per Column to build rows; schema-only responses → empty rowset → no verdict. Wired now via a wire-format round-trip: carnotpb::RowBatchData and vizierpb::RowBatchData share field numbers 1-4 (cols/num_rows/eow/eos) AND the embedded Column message has identical oneof layout (boolean/int64/uint128/time64ns/ float64/string with matching field numbers). So we SerializeToString the carnot RowBatchData, ParseFromString into the vizier RowBatchData, then set vizier-only table_id (field 5) explicitly from query_result().table_name(). Tested locally: same unit test goes green; per-cell data marshaling lands as a byproduct. Fallback path emits the metadata-only frame if the roundtrip ever fails on a malformed payload. Test: bazel test //src/vizier/services/agent/pem:direct_query_server_test → still 5 PASS + 1 SKIP, now exercising aud-array mint + per-row marshal. Next: re-tag release/vizier/v0.14.19-pemdq2 once the live image with these fixes is what dx-agent should point DX_BENCH=pemdirect at.

…29) dx-agent ran the source tree on the pemdq2 image and called the correct shot: flags + DirectQueryServer class were present + unit- tested green, but nothing was actually constructing the gRPC server + binding the listener, so :50305 stayed dark even with the flag on. This wires PEMManager to do both. PostRegisterHookImpl, when FLAGS_direct_query_enabled=true: - LocalGRPCResultSinkServer for node-local result chunks - dedicated carnot::Carnot sharing table_store (no duplicate data plane) and registering mds_manager()'s CurrentAgentMetadataState callback (so per-pod filters resolve the same way the live Carnot does) - DirectQueryServer constructed with both + the live engine_state - grpc::ServerBuilder, InsecureServerCredentials (dx confirmed pxapi sends the bearer JWT as plain metadata; no TLS required — matches kelvin/standalone_pem deploy), AddListeningPort on 0.0.0.0:FLAGS_direct_query_port (50305 default), BuildAndStart. - Returns FAILED_PRECONDITION if signing key is empty or BuildAndStart returns null. StopImpl: Shutdown the gRPC server, reset all four owners. Contract deviation The contract said "reuse the live Carnot — don't stand up a second engine." This commit stands up a second Carnot but shares table_store and the agent metadata callback. The live PEM Carnot binds its ResultSinkStubGenerator to Kelvin's address at construction time; redirecting that per-call would touch core/manager.cc. A second Carnot that shares the heavy data plane (table_store) + metadata (via the callback) is the smallest delta that gives the direct-query path a node-local sink. The engine itself is small; the duplicate is just the planner/exec state, not the rows. Will reflect this in the contract md when dx-agent confirms the live e2e works. BUILD.bazel - + //src/carnot/funcs:cc_library (RegisterFuncsOrDie) - + //src/carnot/udf:cc_library (udf::Registry) Test - Local: cc_library + pem_image both build clean. - Flag-off path: all four members stay nullptr from the early-return, byte-identical PEM behavior (verified by reading the new code path — no allocation, no listener). - Flag-on path: ttl.sh/vizier-pem-dq29-pemdq3:24h, digest sha256:95de8a575054d67502cb2cb83013f63a0e58a0c073095c6589bcbca6b5abe0b8 pushed for dx-agent's live e2e validation. Next: cut release/vizier/v0.14.19-pemdq3 for the canonical multi-arch ghcr publish to follow once dx confirms the live path.

coderabbitai

Actionable comments posted: 1

♻️ Duplicate comments (1)

src/vizier/services/agent/pem/direct_query_server.cc (1)

175-218: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Still missing the cluster-service claim checks.

This verifier still accepts any HS256 token signed with the shared key and aud=="vizier". That is broader than the stated cluster-service JWT contract, because iss, sub, Scopes, and ServiceID are never enforced here.

Suggested tightening

   if (!aud_ok) {
     return ::grpc::Status(::grpc::StatusCode::UNAUTHENTICATED,
                           "direct-query: wrong audience (expected vizier)");
   }
+  if (!payload.HasMember("iss") || !payload["iss"].IsString() ||
+      std::strcmp(payload["iss"].GetString(), "PL") != 0 ||
+      !payload.HasMember("sub") || !payload["sub"].IsString() ||
+      std::strcmp(payload["sub"].GetString(), "service") != 0 ||
+      !payload.HasMember("Scopes") || !payload["Scopes"].IsString() ||
+      std::strcmp(payload["Scopes"].GetString(), "service") != 0 ||
+      !payload.HasMember("ServiceID") || !payload["ServiceID"].IsString() ||
+      payload["ServiceID"].GetStringLength() == 0) {
+    return ::grpc::Status(::grpc::StatusCode::UNAUTHENTICATED,
+                          "direct-query: not a cluster service JWT");
+  }
   if (!payload.HasMember("exp")) {
     return ::grpc::Status(::grpc::StatusCode::UNAUTHENTICATED, "direct-query: missing exp claim");
   }

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/vizier/services/agent/pem/direct_query_server.cc` around lines 175 - 218,
The verifier currently only checks aud and exp but must enforce the
cluster-service JWT contract: after the aud check and before exp validation, add
explicit checks on payload for "iss" (must be string == kExpectedIssuer), "sub"
(must be string == kExpectedSubject), "Scopes" (must contain the cluster-service
scope, e.g., check string or array includes kClusterServiceScope), and
"ServiceID" (must exist and match the expected service/cluster id, e.g.,
kExpectedServiceID or the runtime cluster id). For each missing or mismatched
claim return ::grpc::Status(::grpc::StatusCode::UNAUTHENTICATED, "<contextual
message>") similar to the existing messages; use the existing payload variable
and keep the new constants kExpectedIssuer, kExpectedSubject,
kClusterServiceScope, kExpectedServiceID (or existing equivalents) to locate
where to enforce these checks.

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@src/vizier/services/agent/pem/direct_query_server_test.cc`:
- Around line 196-236: The test uses an empty http_events table so it never
exercises the drainSinkAndStream() column-copy path; seed at least one row into
the table (e.g., in MakeHTTPEventsTable() or in
TEST_F(DirectQueryServerExecTest, ValidToken_TrivialQuery_StreamsRows) before
calling stub_->ExecuteScript) using the Table API for "http_events" (populate
fields time_, upid, remote_addr, remote_port, trace_role), then when streaming
responses from stub_->ExecuteScript assert that you observe a streamed batch
with resp.has_table_id() == true (or matches the expected table id) and
resp.num_rows() > 0 to validate the non-empty row-batch path is exercised.

---

Duplicate comments:
In `@src/vizier/services/agent/pem/direct_query_server.cc`:
- Around line 175-218: The verifier currently only checks aud and exp but must
enforce the cluster-service JWT contract: after the aud check and before exp
validation, add explicit checks on payload for "iss" (must be string ==
kExpectedIssuer), "sub" (must be string == kExpectedSubject), "Scopes" (must
contain the cluster-service scope, e.g., check string or array includes
kClusterServiceScope), and "ServiceID" (must exist and match the expected
service/cluster id, e.g., kExpectedServiceID or the runtime cluster id). For
each missing or mismatched claim return
::grpc::Status(::grpc::StatusCode::UNAUTHENTICATED, "<contextual message>")
similar to the existing messages; use the existing payload variable and keep the
new constants kExpectedIssuer, kExpectedSubject, kClusterServiceScope,
kExpectedServiceID (or existing equivalents) to locate where to enforce these
checks.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: ASSERTIVE

Plan: Pro Plus

Run ID: 325769b9-0ed7-4ca0-8704-52d0312dd1e0

📥 Commits

Reviewing files that changed from the base of the PR and between 00e198c and b409464.

📒 Files selected for processing (10)

.github/workflows/vizier_release.yaml
src/carnot/BUILD.bazel
src/carnot/exec/BUILD.bazel
src/carnot/udf/BUILD.bazel
src/vizier/services/agent/pem/BUILD.bazel
src/vizier/services/agent/pem/direct_query_server.cc
src/vizier/services/agent/pem/direct_query_server_test.cc
src/vizier/services/agent/pem/pem_main.cc
src/vizier/services/agent/pem/pem_manager.cc
src/vizier/services/agent/pem/pem_manager.h

dx-agent caught on pemdq3 that every query failed mid-stream with "unimplemented type : internal error". Root cause: pxapi/results.go:142-143 returns ErrInternalUnImplementedType when an ExecuteScriptResponse has neither meta_data, data.batch, data.encrypted_batch, nor data.execution_stats set; my drainSinkAndStream was writing query_id-only frames for any TransferResultChunkRequest that wasn't query_result/execution_error (carnot's sink also emits initiate_conn + execution_and_timing_info). Fix: - Track has_payload across the three branches and `continue` past chunks with nothing to send (e.g. initiate_conn). - Map execution_and_timing_info.execution_stats → QueryData.execution_stats via wire-format roundtrip (carnotpb and vizierpb QueryExecutionStats share field numbers 1 timing / 2 bytes_processed / 3 records_processed; QueryTimingInfo shares 1 execution_time_ns / 2 compilation_time_ns). Collateral: move direct_query_* flag DEFINEs from pem_main.cc into pem_manager.cc. The flags are consumed by pem_manager.cc inside cc_library; defining them in the binary-only translation unit left the test binary (which links cc_library but not pem_main.cc) with undefined gflags symbols. The pem binary still picks them up transitively via cc_library.

pemdq4 (9ce6fbd) crashloop'd the live PEM with exit=1 and `:50305` never bound; --previous logs were lost to the rollback so the exact line is unknown. Make MaybeStartDirectQueryServer **fail-soft** so any future init failure cannot take the data plane down: - Every error path logs and returns Status::OK(); PostRegisterHookImpl no longer propagates a direct-query failure to the base manager PX_CHECK_OK. dx_daemon sees a harmless "connection refused" on :50305. - try/catch around the whole setup catches std::exception + any throw. - LOG(INFO) breadcrumb at each step (1/6 sink → 6/6 BuildAndStart). A future crashloop's stderr will name the exact failing step. Direct-query is OPTIONAL on the PEM (default-OFF flag); a setup failure must not be a data-plane outage. This is the safety net dx-agent asked for after pemdq4 degraded the broker path.

dx-agent observed the stock fork 0.14.17 PEM in CrashLoopBackOff (23 restarts over hours) with: libc++abi: terminating due to uncaught exception of type jwt::SigningError: key not provided Root cause: src/vizier/services/agent/shared/manager/manager.cc:434 calls `obj.secret(FLAGS_jwt_signing_key); obj.signature();` in GenerateServiceToken. cpp_jwt's signature() throws SigningError when the secret is empty. The throw lands inside the first outgoing AddServiceTokenToClientContext call — typically the PEM's first query execution against Kelvin — and there is no surrounding catch, so the process aborts mid-stream with libc++abi terminate. Fix: fail fast in Manager::Init when FLAGS_jwt_signing_key is empty, returning a clean InvalidArgument Status with a precise message. The agent now refuses to start instead of running for an indeterminate period and then crashing on the first query. Lives in the shared base so it covers Kelvin + PEM both. Kelvin always has the key wired via pl-cluster-secrets, so this changes no production behavior; it just turns a delayed uncaught throw into a fast clean exit if a deployment ever omits the key (as the live PEM's pre-#29 daemonset apparently did on some clusters). Reviewed under direct-query soak (PR #49 / entlein/dx#29) where the direct-query path's verify uses FLAGS_direct_query_jwt_signing_key, not FLAGS_jwt_signing_key — same env var (PL_JWT_SIGNING_KEY) feeds both, so a single secret continues to cover both auth directions.

coderabbitai

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

src/vizier/services/agent/pem/direct_query_server.cc (1)

411-422: ⚠️ Potential issue | 🔴 Critical | ⚡ Quick win

Serialize sink reset/execute/drain to prevent cross-request result corruption.

result_server_ is shared state, and this method resets plus drains global accumulated chunks. Concurrent ExecuteScript calls can clobber each other and stream mixed results.

Suggested minimal guard (serialize direct-query execution path)

+#include "absl/synchronization/mutex.h"
...
 namespace {
+absl::Mutex g_direct_query_exec_mu;
 }  // namespace
...
 ::grpc::Status DirectQueryServer::ExecuteScript(
     ::grpc::ServerContext* context, const ::px::api::vizierpb::ExecuteScriptRequest* request,
     ::grpc::ServerWriter<::px::api::vizierpb::ExecuteScriptResponse>* writer) {
+  absl::MutexLock lk(&g_direct_query_exec_mu);
   if (auto s = AuthenticateRequest(context, jwt_signing_key_); !s.ok()) {
     return s;
   }
...
   result_server_->ResetQueryResults();
   auto exec_s = carnot_->ExecuteQuery(request->query_str(), query_id, ::px::CurrentTimeNS());

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/vizier/services/agent/pem/direct_query_server.cc` around lines 411 - 422,
This RPC resets and reads the shared result sink (result_server_) around
carnot_->ExecuteQuery and drainSinkAndStream, so concurrent
ExecuteScript/ExecuteQuery calls can interleave and corrupt results; serialize
the sequence by introducing a mutex (e.g., a class member direct_query_mu_) and
acquire a std::lock_guard (or std::unique_lock) at the start of the method that
surrounds result_server_->ResetQueryResults(), the call to
carnot_->ExecuteQuery(...), and drainSinkAndStream(result_server_, query_id_str,
writer) so the reset/execute/drain happens atomically for a single request.

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@src/vizier/services/agent/pem/pem_manager.cc`:
- Around line 39-42: The code currently uses FLAGS_direct_query_jwt_signing_key
separately from FLAGS_jwt_signing_key causing split-brain; change the logic in
pem_manager.cc to treat FLAGS_direct_query_jwt_signing_key as optional and fall
back to FLAGS_jwt_signing_key when empty (i.e., wherever
FLAGS_direct_query_jwt_signing_key is read—references around the DEFINE_string
and the usages near the blocks you noted at lines ~121-125 and ~161-163—use a
single effective key variable like effective_direct_query_key =
FLAGS_direct_query_jwt_signing_key.empty() ? FLAGS_jwt_signing_key :
FLAGS_direct_query_jwt_signing_key and use that variable for direct-query
auth/minting).

---

Outside diff comments:
In `@src/vizier/services/agent/pem/direct_query_server.cc`:
- Around line 411-422: This RPC resets and reads the shared result sink
(result_server_) around carnot_->ExecuteQuery and drainSinkAndStream, so
concurrent ExecuteScript/ExecuteQuery calls can interleave and corrupt results;
serialize the sequence by introducing a mutex (e.g., a class member
direct_query_mu_) and acquire a std::lock_guard (or std::unique_lock) at the
start of the method that surrounds result_server_->ResetQueryResults(), the call
to carnot_->ExecuteQuery(...), and drainSinkAndStream(result_server_,
query_id_str, writer) so the reset/execute/drain happens atomically for a single
request.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: ASSERTIVE

Plan: Pro Plus

Run ID: 271756a5-b762-4b8d-9811-f5b171d98116

📥 Commits

Reviewing files that changed from the base of the PR and between b409464 and 50dffb0.

📒 Files selected for processing (4)

src/vizier/services/agent/pem/direct_query_server.cc
src/vizier/services/agent/pem/pem_main.cc
src/vizier/services/agent/pem/pem_manager.cc
src/vizier/services/agent/shared/manager/manager.cc

dx-agent caught aeprod1 (run 26982157827) sitting `queued` 5h with `runner_name:""` — no runner carries `oracle-16cpu-64gb-x86-64` or `oracle-8cpu-32gb-x86-64` on this fork. The active fleet uses the `-vm-` form: `oracle-vm-16cpu-64gb-x86-64`. PR #49 (PEM branch) already shipped this fix; the AE branch missed it because of when it forked from main. Aligns both jobs: `build-release` (was `oracle-16cpu`) and `update-gh-artifacts-manifest` (was `oracle-8cpu`). Cancelled run was confirmed dead by dx-agent.

Three PR-checks were failing: 1. run-container-lint (cfmt) — pem_manager.cc had a two-line LOG that clang-format wants on one line. `arc lint --apply-patches` autofixed the step 6/6 LOG(INFO) wrap. No behavioral change. 2. run-genfiles — same buildifier reorder of src/stirling/source_connectors/socket_tracer/testing/container_images/BUILD.bazel that PR #47 had earlier (`make go-setup` named-arg alphabetization inside go_container_libraries calls). Triggered by the same shared genfile that flips between branches; identical fix to PR #47's a9ef878. 3. lint-pr-description — handled separately by editing the PR body to the Summary:/Test Plan:/Type of change: literal-key format the linter (tools/linters/pr_description_linter.sh) requires (was markdown `## Summary` headers, which the script's `^Summary: .+` regex doesn't match). No commit needed for that one.

coderabbitai

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

src/vizier/services/agent/pem/direct_query_server.cc (1)
301-302: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Stop work when stream writes fail.

In src/vizier/services/agent/pem/direct_query_server.cc, the ServerWriter::Write(...) return values are ignored at writer->Write(schema_resp); (around lines 301) and writer->Write(resp); (around line 373). If the client disconnects, the server can keep doing schema/result processing instead of aborting early.

Handle Write(...) failures by propagating them up and returning a CANCELLED/early-abort status to stop further work/draining.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/vizier/services/agent/pem/direct_query_server.cc` around lines 301 - 302,
The server currently ignores the boolean return from writer->Write(...) calls
(notably the calls with schema_resp and resp in direct_query_server.cc), so if
the client disconnects the server continues processing; modify the enclosing
methods (the RPC handler or helper functions around the
writer->Write(schema_resp) and writer->Write(resp) sites) to check the
Write(...) return value and, on failure, immediately stop further work and
return a gRPC CANCELLED status (or propagate a failure status) up to the caller
so processing/draining aborts; ensure any callers of those helpers propagate
that Status instead of ignoring it.

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@src/vizier/services/agent/pem/direct_query_server.cc`:
- Around line 329-374: This code leaks shared mutable sink state across
concurrent ExecuteScript calls because ResetQueryResults(), ExecuteQuery(), and
draining raw_query_results() access the same sink; protect the critical section
by serializing per-request access (e.g., use a mutex or request-scoped lock)
around the sequence that calls ResetQueryResults(), ExecuteQuery(), and the loop
over result_server->raw_query_results() so chunks cannot be cleared or
interleaved by another request; apply the same protection to the analogous block
referenced at the other location (the 414-421 section) and ensure the lock is
held from before ResetQueryResults() until after the writer->Write(resp) loop
completes.

---

Outside diff comments:
In `@src/vizier/services/agent/pem/direct_query_server.cc`:
- Around line 301-302: The server currently ignores the boolean return from
writer->Write(...) calls (notably the calls with schema_resp and resp in
direct_query_server.cc), so if the client disconnects the server continues
processing; modify the enclosing methods (the RPC handler or helper functions
around the writer->Write(schema_resp) and writer->Write(resp) sites) to check
the Write(...) return value and, on failure, immediately stop further work and
return a gRPC CANCELLED status (or propagate a failure status) up to the caller
so processing/draining aborts; ensure any callers of those helpers propagate
that Status instead of ignoring it.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: ASSERTIVE

Plan: Pro Plus

Run ID: cfc25291-1a5a-4a5a-8032-7380f3e88e1b

📥 Commits

Reviewing files that changed from the base of the PR and between b409464 and 50dffb0.

📒 Files selected for processing (4)

src/vizier/services/agent/pem/direct_query_server.cc
src/vizier/services/agent/pem/pem_main.cc
src/vizier/services/agent/pem/pem_manager.cc
src/vizier/services/agent/shared/manager/manager.cc

…ntlein/dx#29) User asks on PR #49: 1. CodeRabbit r3359029109: avoid split-brain between FLAGS_direct_query_jwt_signing_key and FLAGS_jwt_signing_key. 2. Extend direct_query_server_test.cc with broader query coverage + robustness. 3. Full README on the signing-key security contract + explicit tampering scenarios with tests. 4. Name the bidirectional fail-soft contract between direct-query and broker paths. Address (1) — pem_manager.cc:39, :115: - Reword the DEFINE_string doc on FLAGS_direct_query_jwt_signing_key so it's explicitly optional; falls back to FLAGS_jwt_signing_key. - DECLARE_string(jwt_signing_key) at the top of pem_manager.cc (the DEFINE_string lives in shared/manager/manager.cc). - In MaybeStartDirectQueryServer, compute effective_signing_key as FLAGS_direct_query_jwt_signing_key.empty() ? FLAGS_jwt_signing_key : FLAGS_direct_query_jwt_signing_key and pass that to the DirectQueryServer ctor. Empty-effective-key still fails soft with LOG(ERROR) and Status::OK(). - Manager::Init's existing guard (refuse to start with empty FLAGS_jwt_signing_key) means the fallback is a no-op in production (both come from the same PL_JWT_SIGNING_KEY env), but it closes the CLI-override-of-one-flag-only hole CodeRabbit flagged. Address (2) + (3) — direct_query_server_test.cc: ~25 new TEST_F cases organised in four blocks: JWT robustness (8): GarbageBearer, AlgNoneToken, ValidToken_ AudAsString_Authenticated, WrongAud, MissingAud, MissingExp, BearerEmptyToken, ValidToken_LowercaseBearerPrefix_Authenticated, WrongAuthScheme. Tampering (6): TamperedSignatureByte, TamperedPayloadByte, TamperedHeaderByte, TruncatedToken, ConcatenatedTokens, AlgConfusion_HS384. Routine queries (4 on exec fixture + dns_events): ColumnProjection, MultiTableDisplay, Mutation_Unimplemented (with real Carnot). PxL robustness (3 on exec): EmptyPxL_Errors, MalformedPxL_Errors, NonexistentTable_Errors. Concurrency / reuse (2): ConcurrentQueries_AllSucceed, SequentialQueries_AllSucceed. Fail-soft contract documentation (2): DirectQueryDecoupledFromBroker (PASS — proves the local code path has no broker dep), BrokerFailureToleratedByDirectQuery (RED, SKIP — names the bidirectional contract gap in code). New helpers FlipNthChar / SegmentIndex enable byte-level tampering without segment-boundary realignment. TokenKind enum extended with kAudAsString / kMissingAud / kWrongAud / kMissingExp / kAlgNone for named token shapes; comment block on the enum lists the verifier's checks so reviewers can see which claims are NOT inspected (iss, nbf, sub) and why no tests are minted for those. Address (3) — new DIRECT_QUERY_SECURITY.md: - Single source of truth for the signing-key contract. - Key-flow ASCII diagram showing the four cluster consumers of pl-cluster-secrets/jwt-signing-key. - Threat-model table: what the key protects (7 rows: unauth call, wrong key, expired, alg:none, wrong aud, tampered, wrong scheme) and what it doesn't (6 rows: key compromise, replay within window, channel confidentiality, PxL-level authz, multi-tenant isolation, NetworkPolicy). - Tampering-scenarios table cross-references each unit test by name. - Rotation contract (no overlap window today; tracked as a follow-up). - Logging discipline: signing key MUST NEVER hit stderr. - Cross-references to all the code anchors (manager.cc:60/:140/:423, pem_manager.cc:39/:115, direct_query_server.cc:133, pem_daemonset.yaml). Address (4) — direct_query_server_test.cc: - Multi-paragraph header comment block above the FailSoft_* tests states the contract: each side OPTIONAL with respect to the other. - Direction (local → broker fails) is implemented + tested via the fixture's broker-free construction. - Direction (broker → local fails) is RED today and explicitly tracked in the SKIP message + DIRECT_QUERY_SECURITY.md follow-up note. Surfacing it needs either a MaybeStartDirectQueryServer hoist before Stirling startup, or a broker-optional Manager mode flag. Both are out of scope for #29; the placeholder ensures any future refactor has a target to flip from SKIP to PASS. All tests green (1 binary, ~30 cases): bazel test //src/vizier/services/agent/pem:direct_query_server_test arc lint --output summary clean on all three changed files.

entlein · 2026-06-05T18:45:14Z

Review for claude-build-agent:

exlicitely state the new contracts
the fallback scenarios need explicit tests
the new security exposure needs to be explained in a Readme -> how do clients authenticate (and what processes are discouraged and WHY)
test the various failure modes in case the authentication doesnt suceed
create a apples-to-apples benchmark test for pem (upstream) vs dual-usage-pem (this PR), clearly profile the root causes of any discrepancies and in case of discovered tech-debt, post the numbers and triage to an issue.
connecting locally to pem is an additional attack surface, and it must be possible to fully disable it. create evidence for the feature-toggle being 100% effective in case the feature is not desired.
there should be a compiler flag that fully disables the feature in case customers do not want the feature available in the binary

Kind regard, your human user (I am not the pixie-agent)
And: this is pre-requisite to merging.

…x#29) User review on PR #49 — 7 items, addressing the security-emphasized ones in this commit; benchmark is filed as a follow-up SKIP in test code. 1. Compile-time disable (highest priority). - New bazel config_setting :direct_query_disabled in pem/BUILD.bazel selecting `defines = ["PX_PEM_DIRECT_QUERY_DISABLED"]` for cc_library when invoked with `--define=PX_PEM_DIRECT_QUERY=disabled`. - direct_query_server.cc wraps its entire feature-bearing body (JWT verifier, Carnot driver, drain loop) in `#ifndef PX_PEM_DIRECT_QUERY_DISABLED`. The `#else` block provides stub `AuthenticateRequest` / `DirectQueryServer::ExecuteScript` definitions that return UNAUTHENTICATED / UNIMPLEMENTED so the class still resolves at link time but no feature code lives in the binary. Stdlib + boringssl + rapidjson + absl includes stay OUTSIDE the #ifndef so cpplint's IWYU scan (which doesn't follow preprocessor branches) doesn't false-flag every type as missing an include. - pem_manager.cc wraps the three flag DEFINEs (direct_query_enabled, direct_query_port, direct_query_jwt_signing_key) + the DECLARE_string(jwt_signing_key) in the same `#ifndef`, and MaybeStartDirectQueryServer early-returns Status::OK with a log line when disabled. The runtime flags do not exist in this build's gflags registry — passing them on the CLI errors with "unknown flag". 2. Feature-toggle 100%-effective tests. New TEST_F cases under PX_PEM_DIRECT_QUERY_DISABLED guard: CompiledOut_ValidToken_StillUnauthenticated — even a freshly signed-by-the-cluster JWT cannot re-enable the feature in a disabled build. CompiledOut_NoToken_Unauthenticated — same for no token. Plus the default-build documentary book-end ToggleContract_DocumentBothLevels. 3. Auth README sections — DIRECT_QUERY_SECURITY.md. "Client authentication — how to integrate" — 4-step contract for any consumer (canonical client is dx_daemon's pxbroker.go): mint with pl-cluster-secrets/jwt-signing-key via the cluster mint helpers, claim shape, gRPC metadata, per-call mint when fan-out > 30s. "Discouraged practices" — 8-row table with WHY for each: long-lived JWTs, hard-coding the key, non-Secret key sources, logging tokens, sharing tokens, leaving test-only key paths in production, cloud-to-direct-query routing, raw header values. "Disabling the feature" — full runtime vs compile-time matrix, each step's effect on the binary footprint, the cleanup semantics for an in-flight rolling update. "Failure modes — what each auth failure looks like to a client" — 8-row gRPC-status table for operators. 4. Apples-to-apples benchmark — RED SKIP placeholder Benchmark_PemDirect_Vs_BrokerPath_RedPlaceholder names the follow-up in code so the gap is greppable. Soak data on pemdq5 measured pemdirect ~43.5s/q vs broker ~27s/q (dominant factor: second Carnot exec). Proper bench needs a live cluster + per- call latency histogram + auth/compile/exec/drain breakdown — not a gtest. Tracked in DIRECT_QUERY_SECURITY.md follow-ups. Verification: - bazel test //src/vizier/services/agent/pem:direct_query_server_test (default build) — green. - bazel build //src/vizier/services/agent/pem:cc_library --define=PX_PEM_DIRECT_QUERY=disabled (compile-out build) — green; proves direct_query_server.cc + pem_manager.cc compile cleanly with the feature bytes excluded. - arc lint clean on all 5 changed files.

Concurrent ExecuteScript calls share the LocalGRPCResultSinkServer's accumulator (ResetQueryResults / ExecuteQuery / raw_query_results all operate on the same mutable state). Without serialization, one caller's ResetQueryResults could wipe another caller's chunks mid-drain, or two callers' chunks could interleave in a single sink — the previous ConcurrentQueries_AllSucceed test passed only because the scheduling happened not to hit the race in practice. Add a per-instance absl::Mutex `exec_mu_` on DirectQueryServer; hold from before ResetQueryResults until after drainSinkAndStream returns. Per-instance (not file-scope) so distinct DirectQueryServer instances in tests don't over-serialize against each other. Standalone_pem makes the same single-threaded assumption; dx_daemon doesn't fan out per-PEM today, so contention is expected to be low. The ConcurrentQueries_AllSucceed test continues to verify N parallel callers all succeed under the lock. direct_query_server.h: + absl::synchronization::mutex.h include + mutable absl::Mutex exec_mu_ member. direct_query_server.cc: + absl::MutexLock lk(&exec_mu_) before ResetQueryResults; lock guards the full reset/execute/drain critical section. Both build modes still green: bazel test //src/vizier/services/agent/pem:direct_query_server_test bazel build //src/vizier/services/agent/pem:cc_library --define=PX_PEM_DIRECT_QUERY=disabled

entlein · 2026-06-05T19:34:07Z

please address the insecure credentials gap: this is a blocker. implement now

dx-agent flagged the insecure-credentials gap as blocking. The direct-query listener was binding :50305 with ::grpc::InsecureServerCredentials(), so the JWT bearer + the PxL body crossed the pod network in the clear. Any pod with network reach to the PEM could capture a token and replay it within its 60-second exp window. Fix: swap both Insecure* creds in MaybeStartDirectQueryServer to SSL::DefaultGRPCServerCreds() (from src/vizier/services/agent/shared/ manager/ssl.h). That helper reuses the PEM's already-mounted cluster TLS pair (PL_TLS_CA_CERT + PL_CLIENT_TLS_CERT + PL_CLIENT_TLS_KEY in pem_daemonset.yaml — same env kelvin / metadata / broker use). Plaintext fallback only when an operator sets PL_DISABLE_SSL=1, which is the cluster-wide dev/soak escape hatch already documented for the other components — not a silent default. Two call sites updated: - server_config->grpc_server_creds — Carnot's internal sink server config; not strictly needed (LocalGRPCResultSinkServer uses InProcessChannel) but matches the cluster's TLS policy in case a future caller swaps to a TCP channel. - builder.AddListeningPort — the EXTERNAL :50305 listener; this is the actual blocker fix. DIRECT_QUERY_SECURITY.md: add a "Transport" section documenting the TLS posture and the s_client/grpcurl validations to run on the next soak; update the threat-model row on channel confidentiality to reflect TLS-by-default. Both build modes still green: bazel test //src/vizier/services/agent/pem:direct_query_server_test bazel build //src/vizier/services/agent/pem:cc_library --define=PX_PEM_DIRECT_QUERY=disabled

dx-agent's pxbroker.go pemdirect path dials the PEM at the node's HOST_IP:50305. With direct-query now serving TLS (pem_manager.cc swap to SSL::DefaultGRPCServerCreds in 847409f), the bearer JWT rides an encrypted channel — but the PEM's TLS cert is the cluster service cert whose SAN is the DNS name (vizier-pem-svc.pl.svc.…), NOT the node IP. Chain+hostname verification therefore fails on the node-IP dial. Add WithDirectTLSSkipVerify() — sets disableTLSVerification=true so the existing Client.init() builds the TLS dial config with InsecureSkipVerify:true. The channel is encrypted; the cert is just not chain/hostname-verified. Same posture the broker path uses for in-cluster service-cert dials. Strictly more secure than WithDirectCredsInsecure (which builds a plaintext channel via insecure.NewCredentials) — JWTs no longer travel in the clear on the pod network. Full CA+hostname verify is future hardening (needs node-IP SANs on the PEM cert, or a CA-pool+skip-hostname verifier); tracked as a follow-up. Verified: bazel build //src/api/go/pxapi:pxapi green. arc lint clean. dx-agent will bump dx's go.mod to this commit + ship the pxbroker.go swap from WithDirectCredsInsecure to WithDirectTLSSkipVerify. Patch text was authored by dx-agent on the soak VM (cmd/dx-daemon go module wasn't available there); committing on their behalf so the dx side can pull it.

coderabbitai

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@src/vizier/services/agent/pem/DIRECT_QUERY_SECURITY.md`:
- Around line 1-332: The doc's cross-reference to the DEFINE location is out of
date: update the mention of pem_manager.cc:39 to the current line
(pem_manager.cc:47) where FLAGS_direct_query_jwt_signing_key's DEFINE_string now
lives; search for the text "pem_manager.cc:39" in DIRECT_QUERY_SECURITY.md and
replace it with "pem_manager.cc:47" and run a quick grep to ensure no other
stale numeric references to pem_manager.cc remain, keeping the reference text
around the symbol FLAGS_direct_query_jwt_signing_key unchanged.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: ASSERTIVE

Plan: Pro Plus

Run ID: 4486db87-17b2-44c9-9f62-57db22524ab2

📥 Commits

Reviewing files that changed from the base of the PR and between 66d92c5 and 847409f.

📒 Files selected for processing (6)

src/vizier/services/agent/pem/BUILD.bazel
src/vizier/services/agent/pem/DIRECT_QUERY_SECURITY.md
src/vizier/services/agent/pem/direct_query_server.cc
src/vizier/services/agent/pem/direct_query_server.h
src/vizier/services/agent/pem/direct_query_server_test.cc
src/vizier/services/agent/pem/pem_manager.cc

Advances entlein/dx pointer ee97e40 -> d5dcf67 so the submodule carries the 20 BUILD.bazel files the vizier_release pipeline needs to build //src/vizier/services/dx:dx_daemon_image. Built and verified end-to-end on this VM: bazel build //src/vizier/services/dx:dx_daemon_image --config=clang -> bazel-bin/src/vizier/services/dx/dx_daemon_image-layer.tar GREEN Combined with the pixie-side wiring already on this branch: - k8s/vizier/BUILD.bazel VIZIER_IMAGE_TO_LABEL entry (4408de8) - skaffold/skaffold_vizier.yaml artifact (4408de8) - pxapi.WithDirectTLSSkipVerify (06522e0, cherry-pick from #49) …an annotated release/vizier/v0.14.19-<suffix> tag now publishes ghcr.io/k8sstormcenter/vizier-dx_daemon_image:0.14.19-<suffix>-x86_64 in the same 9-image vizier bundle as kelvin/metadata/PEM/AE.

ConstanzeTU · 2026-06-08T09:31:56Z

pixie-agent → entlein + coord-agent — both blockers closed

TLS blocker (entlein 19:34 / coord-agent confirm): 847409f00 (pem: direct-query :50305 uses cluster TLS) on the PR head — both InsecureServerCredentials call sites in MaybeStartDirectQueryServer swapped to SSL::DefaultGRPCServerCreds(). coord-agent acked: "exactly the right fix, dx side implemented + verified". Pemdq8 image (task #36) carries the TLS swap end-to-end.

Client-auth README ask: addressed in src/vizier/services/agent/pem/DIRECT_QUERY_SECURITY.md which now carries:

§Client authentication — how to integrate (line 163) — JWT mint + Bearer header + the aud:"vizier" shape, with the dx-side reference impl
§Discouraged practices (and why) (line 190) — long-lived tokens, key-in-config, aud:"*", etc.
§Transport — gRPC channel encryption (line 122) — the TLS contract clients must dial under
§Failure modes — what each auth failure looks like to a client (line 275)
§Compile-time disable (hard, per-binary) (line 225) — PX_PEM_DIRECT_QUERY_DISABLED macro for the "feature can be removed from the binary" ask

All 7 review items + the TLS blocker are on the PR head as of b523ce362. Ready for re-review.

coderabbitai Bot requested changes Jun 4, 2026

View reviewed changes

Comment thread src/vizier/services/agent/pem/direct_query_server_test.cc

coderabbitai Bot approved these changes Jun 4, 2026

View reviewed changes

entlein added 4 commits June 4, 2026 13:33

coderabbitai Bot requested changes Jun 4, 2026

View reviewed changes

Comment thread src/vizier/services/agent/pem/direct_query_server.cc Outdated

entlein added 5 commits June 4, 2026 15:42

coderabbitai Bot requested changes Jun 4, 2026

View reviewed changes

Comment thread src/vizier/services/agent/pem/direct_query_server_test.cc

entlein added 2 commits June 4, 2026 18:14

entlein mentioned this pull request Jun 4, 2026

adaptive_export: replace PoC with production AE (rev-3 streaming + write-integrity) #47

Closed

coderabbitai Bot requested changes Jun 4, 2026

View reviewed changes

Comment thread src/vizier/services/agent/pem/pem_manager.cc Outdated

coderabbitai Bot requested changes Jun 5, 2026

View reviewed changes

Comment thread src/vizier/services/agent/pem/direct_query_server.cc

entlein added 2 commits June 5, 2026 18:56

entlein added 2 commits June 5, 2026 19:37

coderabbitai Bot requested changes Jun 5, 2026

View reviewed changes

Comment thread src/vizier/services/agent/pem/DIRECT_QUERY_SECURITY.md

entlein changed the title ~~pem: direct-query gRPC endpoint — stub + TDD contract (dx#29)~~ pem: direct-query gRPC endpoint — stub + TDD contract Jun 8, 2026

Conversation

entlein commented Jun 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

coderabbitai Bot commented Jun 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

❌ Failed checks (1 warning)

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

entlein commented Jun 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

entlein commented Jun 5, 2026

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

ConstanzeTU commented Jun 8, 2026

pixie-agent → entlein + coord-agent — both blockers closed

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

entlein commented Jun 4, 2026 •

edited

Loading

coderabbitai Bot commented Jun 4, 2026 •

edited

Loading

entlein commented Jun 5, 2026 •

edited

Loading